Benchmarking Single- and Multi-Core BLAS Implementations and GPUs for use with R
نویسنده
چکیده
We provide timing results for common linear algebra subroutines across BLAS (Basic Linear Algebra Subprograms) and GPU (Graphics Processing Unit)-based implementations. Several BLAS implementations are compared. The first is the unoptimised reference BLAS which provides a baseline to measure against. Second is the Atlas tuned BLAS, configured for single-threaded mode. Third is the development version of Atlas, configured for multi-threaded mode. Fourth is the optimised and multi-threaded Goto BLAS. Fifth is the multi-threaded BLAS contained in the commercial Intel MKL package. We also measure the performance of a GPU-based implementation for R (R Development Core Team 2010a) provided by the package gputools (Buckner et al. 2010). Several frequently-used linear algebra computations are compared across BLAS (and LAPACK) implementations and via GPU computing: matrix multiplication as well as QR, SVD and LU decompositions. The tests are performed from an end-user perspective, and ‘net’ times (including all necessary data transfers) are compared. While results are by their very nature dependent on the hardware of the test platforms, a few general lessons can be drawn. Unsurprisingly, accelerated BLAS clearly outperform the reference implementation. Similarly, multi-threaded BLAS hold a clear advantage over single-threaded BLAS when used on modern multi-core machines. Between the multithreaded BLAS implementations, Goto is seen to have a slight advantage over MKL and Atlas. GPU computing is showing promise but requires relatively large matrices to outperform multi-threaded BLAS. We also provide the framework for computing these benchmark results as the corresponding R package gcbd. It enables other researchers to compute similar benchmark results which could be the basis for heuristics helping to select optimial computing strategies for a given platform, library, problem and size combination.
منابع مشابه
MPI- and CUDA- implementations of modal finite difference method for P-SV wave propagation modeling
Among different discretization approaches, Finite Difference Method (FDM) is widely used for acoustic and elastic full-wave form modeling. An inevitable deficit of the technique, however, is its sever requirement to computational resources. A promising solution is parallelization, where the problem is broken into several segments, and the calculations are distributed over different processors. ...
متن کاملBatched matrix computations on hardware accelerators based on GPUs
Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for an effective approach to develop energy-efficient, high-performance codes for these small matrix problems that we call batched factorizations....
متن کاملAn Improved MAGMA GEMM for Fermi GPUs
We present an improved matrix-matrix multiplication routine (GEMM) in the MAGMA BLAS library that targets the Fermi GPUs. We show how to modify the previous MAGMA GEMM kernels in order to make a more efficient use of the Fermi’s new architectural features, most notably their extended memory hierarchy and sizes. The improved kernels run at up to 300 GFlop/s in double and up to 600 GFlop/s in sin...
متن کاملAccelerating high-order WENO schemes using two heterogeneous GPUs
A double-GPU code is developed to accelerate WENO schemes. The test problem is a compressible viscous flow. The convective terms are discretized using third- to ninth-order WENO schemes and the viscous terms are discretized by the standard fourth-order central scheme. The code written in CUDA programming language is developed by modifying a single-GPU code. The OpenMP library is used for parall...
متن کاملMAGMA: A Breakthrough in Solvers for Eigenvalue Problems
The Matrix Algebra on GPU and Multicore Architectures (MAGMA) project aims to develop the next generation of LAPACK/ScaLAPACK-compliant linear algebra libraries for heterogeneous multicore-GPU architectures. The functionality covered in LAPACK and ScaLAPACK is a fundamental building block for many scientific computing applications, underlining the importance and potential for broad impact in de...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010